
We received data on 2670 recommended locations, categorized into two types:
Type A: 2536 locations with an operational cost of 25,000 and available estimated sales data.
Type B: 134 locations with an operational cost of 15,000, without specific sales data but can cover nearby Type A locations within a 5 km radius."
dataFrameTypeA.head()
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | |
|---|---|---|---|---|---|---|
| 0 | 1 | الشرقية | حضر | 30.727543 | 31.793701 | 72021.026667 |
| 1 | 2 | القاهرة | حضر | 30.235685 | 31.467567 | 54690.637778 |
| 2 | 3 | المنوفية | حضر | 30.361963 | 30.512031 | 42379.708889 |
| 3 | 4 | أسوان | حضر | 24.977806 | 32.875934 | 42377.312222 |
| 4 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 |
dataFrameTypeB.head()
| ID | Gov | Latitude | Longitude | |
|---|---|---|---|---|
| 0 | 1 | الجيزة | 29.948291 | 31.051539 |
| 1 | 2 | الجيزة | 30.048963 | 30.974668 |
| 2 | 3 | الجيزة | 30.061117 | 30.960385 |
| 3 | 4 | الفيوم | 29.361996 | 30.683888 |
| 4 | 5 | الفيوم | 29.413400 | 30.862680 |
print(dataFrameTypeA.isna().sum())
print(dataFrameTypeB.isna().sum())
ID 0 Gov 0 City_Type 0 Latitude 0 Longitude 0 Estimated_Sales 0 dtype: int64 ID 0 Gov 0 Latitude 0 Longitude 0 dtype: int64
print(dataFrameTypeA.duplicated().sum())
print(dataFrameTypeB.duplicated().sum())
0 0
dataFrameTypeA["Type"]='A'
dataFrameTypeB["Type"]='B'
dataFrameTypeA["operational_costs"]=25000
dataFrameTypeB["operational_costs"]=15000
len(dataFrameTypeA["Gov"].unique())
27
len(dataFrameTypeB["Gov"].unique())
15
len(pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True)["Gov"].unique())
29
plt.figure(figsize=(8, 4))
plt.bar(reshaped_text, gov_counts, color='skyblue')
plt.title(get_display(arabic_reshaper.reshape('توزيع المواقع حسب المحافظة')))
plt.xlabel(get_display(arabic_reshaper.reshape('المحافظة')))
plt.ylabel(get_display(arabic_reshaper.reshape('عدد المواقع')))
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plotDataFrameLongLatWithType(pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True))
Latitude: This measures how far north or south a location is from the Equator, which is 0 degrees latitude. Latitude values range from 0 degrees at the Equator to 90 degrees north at the North Pole and 90 degrees south at the South Pole.
Longitude: This measures how far east or west a location is from the Prime Meridian, which is 0 degrees longitude,Longitude values range from 0 degrees at the Prime Meridian to 180 degrees east and 180 degrees west.
The Haversine formula is used to calculate the shortest distance between two points on the surface of a sphere, which in this case is the Earth. The distance calculated using the Haversine formula is called the great-circle distance.
Accuracy Over Long Distances: The Haversine formula accounts for the spherical shape of the Earth, making it accurate for long distances, unlike simple Euclidean distance which assumes a flat surface.
Option 1: Apply Clustering After Computing Estimated Sales
1-Compute Individual Sales Estimates:
For each Type A location, sum the estimated sales of all other Type A locations within 5 km.
For Type B locations, assign the highest nearby Type A sales estimate within 5 km.
2-Filter Locations Based on Sales vs. Costs:
Identify which locations (both Type A and Type B) have sales estimates that meet or exceed their respective operational costs.
3-Cluster Filtered Locations:
Apply a clustering algorithm to the filtered list of locations (those that have sales estimates covering operational costs) to group them based on their geographic coordinates with a 5 km distance parameter.
Option 2: Apply Clustering Before Computing Estimated Sales
1-Cluster Locations:
Use a clustering algorithm (DBSCAN, OPTICS, HDBSCAN) to group locations based on their geographic coordinates (latitude and longitude) with a distance parameter set to 5 km.
This will help identify clusters of locations that are within 5 km of each other.
2-Compute Combined Sales for Each Cluster:
For each cluster, sum the estimated sales of all locations within that cluster.
Assign the combined sales to a representative location (e.g., the centroid of the cluster or the location with the highest sales within the cluster).
3-Evaluate Clusters Against Operational Costs:
Determine if the combined sales for each cluster meet or exceed the operational costs.
Select the clusters (or representative locations) that achieve the best coverage and sales.
Ball Tree Construction Steps:






Copy Version For Ball Tree Algorthim
dataFrameTypeABall=dataFrameTypeA
dataFrameTypeBBall=dataFrameTypeB
Recommendation-A Calculation
indices,distances=ballTreeCall(dataFrameTypeABall,dataFrameTypeABall)
updatedEstimastedSales=[round(sum([dataFrameTypeABall.iloc[idx]["Estimated_Sales"] for idx in indices[i]]),3) for i in range(len(dataFrameTypeABall))]
dataFrameTypeABall["updatedEstimastedSales"]=updatedEstimastedSales
Recommendation-B Calculation
indices,distances=ballTreeCall(dataFrameTypeABall,dataFrameTypeBBall)
estimated_sales=[]
for i in range(len(dataFrameTypeBBall)):
if len(indices[i])>0:
estimated_sales.append(round(dataFrameTypeABall.iloc[indices[i][0]]["updatedEstimastedSales"],3))
else:
estimated_sales.append(0.0)
dataFrameTypeBBall["updatedEstimastedSales"]=estimated_sales
Combine All Location Togheter To select from them
combinedDfBall = pd.concat([dataFrameTypeABall, dataFrameTypeBBall], ignore_index=True)
Apply Fiter for not take Locations that EstimatedSales<Operation cost
displayRemovedLocations(combinedDfBall)
len(combinedDfBall[combinedDfBall["updatedEstimastedSales"]<combinedDfBall["operational_costs"]])
2024
Remove Location That their costs greater than their sales
combinedDfBall=combinedDfBall[combinedDfBall["updatedEstimastedSales"]>combinedDfBall["operational_costs"]]
Create Column For Coverage
combinedDfBall["Coverage"]=combinedDfBall["updatedEstimastedSales"]-combinedDfBall["operational_costs"]
combinedDfBall
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | الشرقية | حضر | 30.727543 | 31.793701 | 72021.026667 | A | 25000 | 96084.883 | 71084.883 |
| 1 | 2 | القاهرة | حضر | 30.235685 | 31.467567 | 54690.637778 | A | 25000 | 57434.150 | 32434.150 |
| 2 | 3 | المنوفية | حضر | 30.361963 | 30.512031 | 42379.708889 | A | 25000 | 51029.819 | 26029.819 |
| 3 | 4 | أسوان | حضر | 24.977806 | 32.875934 | 42377.312222 | A | 25000 | 63934.114 | 38934.114 |
| 4 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 | A | 25000 | 68756.226 | 43756.226 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2663 | 128 | القليوبية | NaN | 30.145500 | 31.215355 | NaN | B | 15000 | 47367.362 | 32367.362 |
| 2664 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996 | 94193.996 |
| 2665 | 130 | القليوبية | NaN | 30.186000 | 31.353272 | NaN | B | 15000 | 30255.746 | 15255.746 |
| 2666 | 131 | القليوبية | NaN | 30.138000 | 31.220216 | NaN | B | 15000 | 47367.362 | 32367.362 |
| 2667 | 132 | الاسماعيلية | NaN | 30.852917 | 32.306078 | NaN | B | 15000 | 28021.987 | 13021.987 |
646 rows × 10 columns
Display Plots with the points contains distance 5km between each other
plotDataFrameLongLatWithType(combinedDfBall,showdistance=True)
We need to apply clustring algorthim or apply anyy algorthim that takes the hieghst coverage withing 5km range
Select the points that have heighst coverage and be sure that no point within it in 5km range so we can filter based on coverage and range

selectedPoints = combinedDfBrute.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBrute = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBrute
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996167 | 94193.996167 |
| 1 | 59 | الشرقية | NaN | 30.723517 | 31.793006 | NaN | B | 15000 | 96210.856667 | 81210.856667 |
| 2 | 116 | القليوبية | ريف | 30.155107 | 31.192262 | 12431.234444 | A | 25000 | 105264.344444 | 80264.344444 |
| 3 | 1946 | القليوبية | ريف | 30.284461 | 31.347094 | 168.683333 | A | 25000 | 94386.237344 | 69386.237344 |
| 4 | 343 | القليوبية | ريف | 30.217819 | 31.209151 | 4702.623333 | A | 25000 | 85800.951111 | 60800.951111 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 132 | 643 | الشرقية | ريف | 30.534998 | 31.717706 | 1864.231111 | A | 25000 | 25760.224444 | 760.224444 |
| 133 | 695 | قنا | ريف | 26.103429 | 32.439087 | 1614.741111 | A | 25000 | 25749.605556 | 749.605556 |
| 134 | 1593 | كفر الشيخ | ريف | 31.280505 | 31.165862 | 305.967778 | A | 25000 | 25678.984444 | 678.984444 |
| 135 | 452 | المنوفية | ريف | 30.516160 | 30.883343 | 3224.993333 | A | 25000 | 25649.080000 | 649.080000 |
| 136 | 1982 | اسيوط | ريف | 27.535805 | 30.793993 | 158.595556 | A | 25000 | 25604.987778 | 604.987778 |
137 rows × 10 columns
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfBrute)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBrute[resultDfBrute['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBrute[resultDfBrute['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBrute)}")
print(f"The Total Sales Equals:{resultDfBrute['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBrute['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBrute['operational_costs'].sum()}")
The Total Number of Location Selected:137 The Total Number of A-Location Selected:94 The Total Number of B-Location Selected:43 The Total Number of Locations Covered by selected:647 The Total Sales Equals:5601378.195133333 The Total Coverage Equals:2606378.1951333336 The Total Costs Equals:2995000
DisplayResults
plotDataFrameLongLatWithType(resultDfBrute,showdistance=True)
selectedPoints = combinedDfBall.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBall = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBall
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996 | 94193.996 |
| 1 | 59 | الشرقية | NaN | 30.723517 | 31.793006 | NaN | B | 15000 | 96210.857 | 81210.857 |
| 2 | 116 | القليوبية | ريف | 30.155107 | 31.192262 | 12431.234444 | A | 25000 | 105264.344 | 80264.344 |
| 3 | 1946 | القليوبية | ريف | 30.284461 | 31.347094 | 168.683333 | A | 25000 | 94386.237 | 69386.237 |
| 4 | 343 | القليوبية | ريف | 30.217819 | 31.209151 | 4702.623333 | A | 25000 | 85800.951 | 60800.951 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 131 | 643 | الشرقية | ريف | 30.534998 | 31.717706 | 1864.231111 | A | 25000 | 25760.224 | 760.224 |
| 132 | 695 | قنا | ريف | 26.103429 | 32.439087 | 1614.741111 | A | 25000 | 25749.606 | 749.606 |
| 133 | 1593 | كفر الشيخ | ريف | 31.280505 | 31.165862 | 305.967778 | A | 25000 | 25678.984 | 678.984 |
| 134 | 452 | المنوفية | ريف | 30.516160 | 30.883343 | 3224.993333 | A | 25000 | 25649.080 | 649.080 |
| 135 | 1982 | اسيوط | ريف | 27.535805 | 30.793993 | 158.595556 | A | 25000 | 25604.988 | 604.988 |
136 rows × 10 columns
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfBall)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBall[resultDfBall['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBall[resultDfBall['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBall)}")
print(f"The Total Sales Equals:{resultDfBall['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBall['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBall['operational_costs'].sum()}")
The Total Number of Location Selected:136 The Total Number of A-Location Selected:94 The Total Number of B-Location Selected:42 The Total Number of Locations Covered by selected:646 The Total Sales Equals:5579939.26 The Total Coverage Equals:2599939.2600000002 The Total Costs Equals:2980000
plotDataFrameLongLatWithType(resultDfBall,showdistance=True)
DBSCAN
DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters
Apply Clustring After perform Standrliztion and Convert to radians
srd=MinMaxScaler()
db = DBSCAN(eps=6/6371.0, min_samples=1, metric='haversine',algorithm='brute')
y_db = db.fit_predict(srd.fit_transform(np.radians(combinedDfBallDBSCN[["Longitude","Latitude"]])))
combinedDfBallDBSCN['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfBallDBSCN,clstr=True)
Select the Point from each cluster based on Distance wihtin same cluster and it's higher coverage amount
For each cluster we take the point that have hiegher coverage amount and wihtin 5km to all points in the cluster
selectedClustePoints = selectPoints(combinedDfBallDBSCN)
Select the point that heighst coverage within 5km range Apply Filter
selectedPoints = selectedClustePoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterDBSCN = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterDBSCN
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996 | 94193.996 | 20 |
| 1 | 58 | الشرقية | NaN | 30.732944 | 31.808439 | NaN | B | 15000 | 96210.857 | 81210.857 | 468 |
| 2 | 116 | القليوبية | ريف | 30.155107 | 31.192262 | 12431.234444 | A | 25000 | 105264.344 | 80264.344 | 73 |
| 3 | 1946 | القليوبية | ريف | 30.284461 | 31.347094 | 168.683333 | A | 25000 | 94386.237 | 69386.237 | 374 |
| 4 | 765 | الشرقية | ريف | 30.717965 | 31.756487 | 1339.983333 | A | 25000 | 91977.783 | 66977.783 | 203 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 133 | 643 | الشرقية | ريف | 30.534998 | 31.717706 | 1864.231111 | A | 25000 | 25760.224 | 760.224 | 179 |
| 134 | 695 | قنا | ريف | 26.103429 | 32.439087 | 1614.741111 | A | 25000 | 25749.606 | 749.606 | 186 |
| 135 | 1593 | كفر الشيخ | ريف | 31.280505 | 31.165862 | 305.967778 | A | 25000 | 25678.984 | 678.984 | 316 |
| 136 | 452 | المنوفية | ريف | 30.516160 | 30.883343 | 3224.993333 | A | 25000 | 25649.080 | 649.080 | 155 |
| 137 | 1982 | اسيوط | ريف | 27.535805 | 30.793993 | 158.595556 | A | 25000 | 25604.988 | 604.988 | 378 |
138 rows × 11 columns
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfBallClusterDBSCN)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterDBSCN[resultDfBallClusterDBSCN['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterDBSCN[resultDfBallClusterDBSCN['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallDBSCN)}")
print(f"The Total Sales Equals:{resultDfBallClusterDBSCN['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterDBSCN['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterDBSCN['operational_costs'].sum()}")
The Total Number of Location Selected:138 The Total Number of A-Location Selected:96 The Total Number of B-Location Selected:42 The Total Number of Locations Covered by selected:646 The Total Sales Equals:5701418.605 The Total Coverage Equals:2671418.605 The Total Costs Equals:3030000
plotDataFrameLongLatWithType(resultDfBallClusterDBSCN,showdistance=True,showNumber=True)
OPTICS
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm, similar to DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Core Distance: It is the minimum value of radius required to classify a given point as a core point. If the given point is not a Core point, then it’s Core Distance is undefined.
Reachability Distance: It is defined with respect to another data point q(Let). The Reachability distance between a point p and q is the maximum of the Core Distance of p and the Euclidean Distance(or some other distance metric) between p and q. Note that The Reachability Distance is not defined if q is not a Core point.
Cluster extraction:OPTICS produces a reachability distance plot that can be used to extract clusters at different levels of granularity. This allows for more flexible clustering and can reveal clusters that may not be apparent with a fixed epsilon value in DBSCAN. However, this also requires more manual interpretation and decision-making on the part of the programmer.
Noise handling: DBSCAN explicitly distinguishes between core points, boundary points, and noise points, while OPTICS does not explicitly identify noise points. Instead, points with high reachability distances can be considered as potential noise points. However, this also means that OPTICS may be less effective at identifying small clusters that are surrounded by noise points, as these clusters may be merged with the noise points in the reachability distance plot.
Runtime complexity: The runtime complexity of OPTICS is generally higher than that of DBSCAN, due to the use of a priority queue to maintain the reachability distances. However, recent research has proposed optimizations to reduce the computational complexity of OPTICS, making it more scalable for large datasets.
optics = OPTICS(min_samples=2, eps=5, metric='haversine',algorithm="brute")
y_db = optics.fit_predict(srd.fit_transform(np.radians(combinedDfBallOPTICS[["Longitude","Latitude"]])))
combinedDfBallOPTICS['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfBallOPTICS,clstr=True)
noiseData=combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]==-1]
startcount=max(combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]!=-1], noiseData], ignore_index=True))
selectedPoints
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | الشرقية | NaN | 30.732944 | 31.808439 | NaN | B | 15000 | 96210.857 | 81210.857 | 0 |
| 1 | 1637 | الشرقية | ريف | 30.685039 | 31.780953 | 282.131111 | A | 25000 | 89609.278 | 64609.278 | 1 |
| 2 | 1487 | الشرقية | ريف | 30.731698 | 31.843480 | 369.911111 | A | 25000 | 103210.140 | 78210.140 | 2 |
| 3 | 866 | الشرقية | ريف | 30.721617 | 31.702214 | 1066.661111 | A | 25000 | 59119.196 | 34119.196 | 3 |
| 4 | 63 | الشرقية | NaN | 30.722667 | 31.664328 | NaN | B | 15000 | 54942.319 | 39942.319 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 283 | 71 | المنوفية | NaN | 30.468696 | 30.934389 | NaN | B | 15000 | 44173.571 | 29173.571 | 283 |
| 284 | 72 | المنوفية | NaN | 30.460861 | 30.938333 | NaN | B | 15000 | 44173.571 | 29173.571 | 284 |
| 285 | 81 | المنوفية | NaN | 30.428361 | 31.034500 | NaN | B | 15000 | 29824.110 | 14824.110 | 285 |
| 286 | 107 | الدقهلية | NaN | 31.093528 | 31.305411 | NaN | B | 15000 | 18315.163 | 3315.163 | 286 |
| 287 | 111 | القليوبية | NaN | 30.226889 | 31.365677 | NaN | B | 15000 | 58432.181 | 43432.181 | 287 |
288 rows × 11 columns
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterOPTICS = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterOPTICS
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996 | 94193.996 | 61 |
| 1 | 58 | الشرقية | NaN | 30.732944 | 31.808439 | NaN | B | 15000 | 96210.857 | 81210.857 | 0 |
| 2 | 116 | القليوبية | ريف | 30.155107 | 31.192262 | 12431.234444 | A | 25000 | 105264.344 | 80264.344 | 52 |
| 3 | 1946 | القليوبية | ريف | 30.284461 | 31.347094 | 168.683333 | A | 25000 | 94386.237 | 69386.237 | 64 |
| 4 | 765 | الشرقية | ريف | 30.717965 | 31.756487 | 1339.983333 | A | 25000 | 91977.783 | 66977.783 | 200 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 116 | 643 | الشرقية | ريف | 30.534998 | 31.717706 | 1864.231111 | A | 25000 | 25760.224 | 760.224 | 190 |
| 117 | 695 | قنا | ريف | 26.103429 | 32.439087 | 1614.741111 | A | 25000 | 25749.606 | 749.606 | 192 |
| 118 | 452 | المنوفية | ريف | 30.516160 | 30.883343 | 3224.993333 | A | 25000 | 25649.080 | 649.080 | 182 |
| 119 | 1982 | اسيوط | ريف | 27.535805 | 30.793993 | 158.595556 | A | 25000 | 25604.988 | 604.988 | 249 |
| 120 | 67 | كفر الشيخ | حضر | 31.309466 | 31.150495 | 16794.541111 | A | 25000 | 25348.356 | 348.356 | 122 |
121 rows × 11 columns
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfBallClusterOPTICS)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterOPTICS[resultDfBallClusterOPTICS['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterOPTICS[resultDfBallClusterOPTICS['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallOPTICS)}")
print(f"The Total Sales Equals:{resultDfBallClusterOPTICS['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterOPTICS['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterOPTICS['operational_costs'].sum()}")
The Total Number of Location Selected:121 The Total Number of A-Location Selected:82 The Total Number of B-Location Selected:39 The Total Number of Locations Covered by selected:646 The Total Sales Equals:5181672.782999999 The Total Coverage Equals:2546672.7830000003 The Total Costs Equals:2635000
DisplayResults
plotDataFrameLongLatWithType(resultDfBallClusterOPTICS,clstr=False,showdistance=True,showNumber=True)
HDBSCAN
Estimating densities We need some method to estimate the density around certain points. One common way to do this is by using “core distance.” This is the distance of a point to its K-th nearest neighbor.
Points in denser regions would have smaller core distances while points in sparser regions would have larger core distances. Core distance is what makes these methods “density-based”.
Simple Cluster Selection
One way to select clusters is to pick a global threshold. By getting the points with densities above the threshold, and grouping these points together, we get our clusters.
hdbscn = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1, cluster_selection_epsilon=5/6371.0, metric='haversine')
y_db = hdbscn.fit_predict(srd.fit_transform(np.radians(combinedDfBallHDBSCN[["Longitude","Latitude"]])))
combinedDfBallHDBSCN['cluster'] = y_db
noiseData=combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]==-1]
startcount=max(combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]!=-1], noiseData], ignore_index=True))
selectedPoints
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 317 | مرسى مطروح | حضر | 31.352412 | 27.255389 | 5367.276667 | A | 25000 | 84420.660 | 59420.660 | 0 |
| 1 | 112 | مرسى مطروح | حضر | 31.327339 | 27.265570 | 12714.126667 | A | 25000 | 84420.660 | 59420.660 | 1 |
| 2 | 1194 | جنوب سيناء | حضر | 27.892750 | 34.298750 | 578.293333 | A | 25000 | 42917.283 | 17917.283 | 2 |
| 3 | 41 | البحر الاحمر | حضر | 28.340011 | 33.086865 | 21800.420000 | A | 25000 | 29658.218 | 4658.218 | 3 |
| 4 | 30 | شمال سيناء | حضر | 31.126209 | 33.808316 | 25193.351111 | A | 25000 | 27226.639 | 2226.639 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 233 | 46 | القاهرة | NaN | 30.164937 | 31.337363 | NaN | B | 15000 | 19896.361 | 4896.361 | 233 |
| 234 | 51 | القاهرة | NaN | 30.165917 | 31.605466 | NaN | B | 15000 | 17900.754 | 2900.754 | 234 |
| 235 | 56 | كفر الشيخ | NaN | 31.274111 | 30.788123 | NaN | B | 15000 | 18797.744 | 3797.744 | 235 |
| 236 | 107 | الدقهلية | NaN | 31.093528 | 31.305411 | NaN | B | 15000 | 18315.163 | 3315.163 | 236 |
| 237 | 111 | القليوبية | NaN | 30.226889 | 31.365677 | NaN | B | 15000 | 58432.181 | 43432.181 | 237 |
238 rows × 11 columns
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterHDBSCN = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterHDBSCN
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | Coverage | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 109193.996 | 94193.996 | 146 |
| 1 | 58 | الشرقية | NaN | 30.732944 | 31.808439 | NaN | B | 15000 | 96210.857 | 81210.857 | 87 |
| 2 | 116 | القليوبية | ريف | 30.155107 | 31.192262 | 12431.234444 | A | 25000 | 105264.344 | 80264.344 | 154 |
| 3 | 1946 | القليوبية | ريف | 30.284461 | 31.347094 | 168.683333 | A | 25000 | 94386.237 | 69386.237 | 140 |
| 4 | 765 | الشرقية | ريف | 30.717965 | 31.756487 | 1339.983333 | A | 25000 | 91977.783 | 66977.783 | 182 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 114 | 643 | الشرقية | ريف | 30.534998 | 31.717706 | 1864.231111 | A | 25000 | 25760.224 | 760.224 | 176 |
| 115 | 695 | قنا | ريف | 26.103429 | 32.439087 | 1614.741111 | A | 25000 | 25749.606 | 749.606 | 178 |
| 116 | 452 | المنوفية | ريف | 30.516160 | 30.883343 | 3224.993333 | A | 25000 | 25649.080 | 649.080 | 169 |
| 117 | 1982 | اسيوط | ريف | 27.535805 | 30.793993 | 158.595556 | A | 25000 | 25604.988 | 604.988 | 217 |
| 118 | 67 | كفر الشيخ | حضر | 31.309466 | 31.150495 | 16794.541111 | A | 25000 | 25348.356 | 348.356 | 38 |
119 rows × 11 columns
DisplayResults
plotDataFrameLongLatWithType(resultDfBallClusterHDBSCN,clstr=False,showdistance=True,showNumber=True)
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfBallClusterHDBSCN)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterHDBSCN[resultDfBallClusterHDBSCN['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterHDBSCN[resultDfBallClusterHDBSCN['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallHDBSCN)}")
print(f"The Total Sales Equals:{resultDfBallClusterHDBSCN['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterHDBSCN['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterHDBSCN['operational_costs'].sum()}")
The Total Number of Location Selected:119 The Total Number of A-Location Selected:80 The Total Number of B-Location Selected:39 The Total Number of Locations Covered by selected:646 The Total Sales Equals:5101434.974999999 The Total Coverage Equals:2516434.975000001 The Total Costs Equals:2585000
OptionOneResult
optionOneDFResult
| choices | # of-Loc | # of-A-Loc | # of-B-Loc | # of-LocCover | TotalSales | TotalCoverage | TotalCosts | |
|---|---|---|---|---|---|---|---|---|
| 0 | HDBSCAN | 119 | 80 | 39 | 646 | 5101434.975 | 2516434.975 | 2585000 |
| 1 | DBSCAN | 138 | 96 | 42 | 646 | 5701418.605 | 2671418.605 | 3030000 |
| 2 | OPTICS | 121 | 82 | 39 | 646 | 5181672.783 | 2546672.783 | 2635000 |
| 3 | BallTree | 136 | 94 | 42 | 646 | 5579939.260 | 2599939.260 | 2980000 |
| 4 | Brute | 137 | 94 | 43 | 647 | 5601378.195 | 2606378.195 | 2995000 |
combinedDFOption2=pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True)
DBSCAN
srd=MinMaxScaler()
db = DBSCAN(eps=6/6371.0, min_samples=1, metric='haversine',algorithm='brute')
y_db = db.fit_predict(srd.fit_transform(np.radians(combinedDfDBSCNOPT2[["Longitude","Latitude"]])))
combinedDfDBSCNOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfDBSCNOPT2,clstr=True)
combinedDfDBSCNOPT2['updatedEstimastedSales'] = combinedDfDBSCNOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
combinedDfDBSCNOPT2=combinedDfDBSCNOPT2[combinedDfDBSCNOPT2["updatedEstimastedSales"]>combinedDfDBSCNOPT2["operational_costs"]]
combinedDfDBSCNOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | الشرقية | حضر | 30.727543 | 31.793701 | 72021.026667 | A | 25000 | 85931.800000 | 0 |
| 1 | 2 | القاهرة | حضر | 30.235685 | 31.467567 | 54690.637778 | A | 25000 | 54690.637778 | 1 |
| 2 | 3 | المنوفية | حضر | 30.361963 | 30.512031 | 42379.708889 | A | 25000 | 42379.708889 | 2 |
| 3 | 4 | أسوان | حضر | 24.977806 | 32.875934 | 42377.312222 | A | 25000 | 56644.472222 | 3 |
| 4 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 | A | 25000 | 41380.507778 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2660 | 125 | القليوبية | NaN | 30.310222 | 31.323466 | NaN | B | 15000 | 65058.061789 | 20 |
| 2663 | 128 | القليوبية | NaN | 30.145500 | 31.215355 | NaN | B | 15000 | 31718.676667 | 12 |
| 2664 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 65058.061789 | 20 |
| 2666 | 131 | القليوبية | NaN | 30.138000 | 31.220216 | NaN | B | 15000 | 31718.676667 | 12 |
| 2667 | 132 | الاسماعيلية | NaN | 30.852917 | 32.306078 | NaN | B | 15000 | 16944.621111 | 60 |
169 rows × 10 columns
combinedDfDBSCNOPT2["Coverage"]=combinedDfDBSCNOPT2["updatedEstimastedSales"]-combinedDfDBSCNOPT2["operational_costs"]
selectedPoints = selectPoints(combinedDfDBSCNOPT2)
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfDBSCNOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfDBSCNOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | الشرقية | NaN | 30.733709 | 31.791613 | NaN | B | 15000 | 85931.800000 | 0 | 70931.800000 |
| 1 | 18 | أسوان | حضر | 24.089784 | 32.898154 | 30039.402222 | A | 25000 | 83389.173333 | 17 | 58389.173333 |
| 2 | 123 | القليوبية | NaN | 30.313824 | 31.318385 | NaN | B | 15000 | 65058.061789 | 20 | 50058.061789 |
| 3 | 30 | المنيا | NaN | 28.647222 | 30.843517 | NaN | B | 15000 | 62077.036667 | 14 | 47077.036667 |
| 4 | 33 | المنيا | NaN | 28.316419 | 30.705637 | NaN | B | 15000 | 59845.363333 | 13 | 44845.363333 |
| 5 | 63 | الشرقية | NaN | 30.722667 | 31.664328 | NaN | B | 15000 | 53305.933333 | 5 | 38305.933333 |
| 6 | 115 | القليوبية | NaN | 30.180428 | 31.217045 | NaN | B | 15000 | 49031.280000 | 7 | 34031.280000 |
| 7 | 4 | أسوان | حضر | 24.977806 | 32.875934 | 42377.312222 | A | 25000 | 56644.472222 | 3 | 31644.472222 |
| 8 | 62 | الشرقية | NaN | 30.523139 | 31.348689 | NaN | B | 15000 | 45520.860000 | 10 | 30520.860000 |
| 9 | 2 | القاهرة | حضر | 30.235685 | 31.467567 | 54690.637778 | A | 25000 | 54690.637778 | 1 | 29690.637778 |
| 10 | 93 | البحيرة | NaN | 30.890535 | 30.663745 | NaN | B | 15000 | 41950.432222 | 8 | 26950.432222 |
| 11 | 77 | المنوفية | NaN | 30.298833 | 30.983217 | NaN | B | 15000 | 37151.773333 | 39 | 22151.773333 |
| 12 | 29 | المنيا | NaN | 27.736193 | 30.834816 | NaN | B | 15000 | 35794.327778 | 29 | 20794.327778 |
| 13 | 105 | الدقهلية | NaN | 31.194083 | 31.528133 | NaN | B | 15000 | 34270.371111 | 11 | 19270.371111 |
| 14 | 53 | الغربية | NaN | 30.939739 | 30.806615 | NaN | B | 15000 | 33141.896667 | 22 | 18141.896667 |
| 15 | 3 | المنوفية | حضر | 30.361963 | 30.512031 | 42379.708889 | A | 25000 | 42379.708889 | 2 | 17379.708889 |
| 16 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 | A | 25000 | 41380.507778 | 4 | 16380.507778 |
| 17 | 7 | القليوبية | ريف | 30.334697 | 31.268586 | 40289.102156 | A | 25000 | 40289.102156 | 6 | 15289.102156 |
| 18 | 102 | الدقهلية | NaN | 31.090019 | 31.597546 | NaN | B | 15000 | 28787.932222 | 23 | 13787.932222 |
| 19 | 92 | البحيرة | NaN | 30.762731 | 30.694217 | NaN | B | 15000 | 27990.341111 | 38 | 12990.341111 |
| 20 | 98 | البحيرة | NaN | 30.830361 | 30.539272 | NaN | B | 15000 | 27810.555922 | 49 | 12810.555922 |
| 21 | 113 | القليوبية | NaN | 30.245453 | 31.353542 | NaN | B | 15000 | 27640.241111 | 78 | 12640.241111 |
| 22 | 22 | اسيوط | حضر | 27.061354 | 31.340125 | 27662.900000 | A | 25000 | 37184.572222 | 21 | 12184.572222 |
| 23 | 96 | البحيرة | NaN | 31.030694 | 30.715550 | NaN | B | 15000 | 26287.638889 | 30 | 11287.638889 |
| 24 | 66 | الشرقية | NaN | 30.791750 | 31.620855 | NaN | B | 15000 | 26143.969833 | 24 | 11143.969833 |
| 25 | 44 | بني سويف | حضر | 28.925050 | 30.855897 | 20982.414444 | A | 25000 | 36143.598889 | 40 | 11143.598889 |
| 26 | 9 | الاسكندرية | NaN | 31.159778 | 29.869967 | NaN | B | 15000 | 26005.710000 | 36 | 11005.710000 |
| 27 | 10 | جنوب سيناء | حضر | 27.860291 | 34.303116 | 34468.973333 | A | 25000 | 34843.250000 | 9 | 9843.250000 |
| 28 | 61 | الشرقية | NaN | 30.535955 | 31.677756 | NaN | B | 15000 | 23483.897778 | 50 | 8483.897778 |
| 29 | 55 | الغربية | NaN | 30.729556 | 31.117917 | NaN | B | 15000 | 23199.595556 | 92 | 8199.595556 |
| 30 | 45 | القاهرة | NaN | 30.469000 | 30.939133 | NaN | B | 15000 | 22645.623333 | 35 | 7645.623333 |
| 31 | 16 | الدقهلية | حضر | 31.166643 | 31.803276 | 30836.737778 | A | 25000 | 32433.765556 | 15 | 7433.765556 |
| 32 | 4 | الفيوم | NaN | 29.361996 | 30.683888 | NaN | B | 15000 | 21962.411111 | 51 | 6962.411111 |
| 33 | 52 | الغربية | NaN | 30.824112 | 30.810771 | NaN | B | 15000 | 21934.857778 | 79 | 6934.857778 |
| 34 | 20 | اسيوط | حضر | 27.443933 | 30.830140 | 29299.514444 | A | 25000 | 31612.167778 | 19 | 6612.167778 |
| 35 | 60 | الشرقية | NaN | 30.750426 | 31.452409 | NaN | B | 15000 | 20863.726667 | 120 | 5863.726667 |
| 36 | 17 | الدقهلية | حضر | 30.881879 | 31.454996 | 30526.830000 | A | 25000 | 30526.830000 | 16 | 5526.830000 |
| 37 | 34 | المنيا | NaN | 28.082083 | 30.763189 | NaN | B | 15000 | 20296.282222 | 48 | 5296.282222 |
| 38 | 70 | الجيزة | ريف | 29.775070 | 31.267214 | 16106.003333 | A | 25000 | 30083.615556 | 64 | 5083.615556 |
| 39 | 19 | دمياط | حضر | 31.354685 | 31.685219 | 29965.090000 | A | 25000 | 29965.090000 | 18 | 4965.090000 |
| 40 | 35 | المنيا | NaN | 28.696693 | 30.784010 | NaN | B | 15000 | 19644.896667 | 46 | 4644.896667 |
| 41 | 68 | المنوفية | NaN | 30.426583 | 31.044210 | NaN | B | 15000 | 19487.016667 | 86 | 4487.016667 |
| 42 | 6 | الفيوم | NaN | 29.474248 | 30.953917 | NaN | B | 15000 | 19153.417833 | 69 | 4153.417833 |
| 43 | 100 | البحيرة | NaN | 30.910763 | 30.173929 | NaN | B | 15000 | 17994.738889 | 55 | 2994.738889 |
| 44 | 51 | القاهرة | NaN | 30.165917 | 31.605466 | NaN | B | 15000 | 17900.754444 | 52 | 2900.754444 |
| 45 | 84 | المنوفية | NaN | 30.549361 | 31.036855 | NaN | B | 15000 | 17693.730000 | 95 | 2693.730000 |
| 46 | 38 | اسيوط | حضر | 27.264926 | 31.154670 | 22491.631111 | A | 25000 | 27147.948889 | 34 | 2147.948889 |
| 47 | 132 | الاسماعيلية | NaN | 30.852917 | 32.306078 | NaN | B | 15000 | 16944.621111 | 60 | 1944.621111 |
| 48 | 30 | شمال سيناء | حضر | 31.126209 | 33.808316 | 25193.351111 | A | 25000 | 26607.242222 | 26 | 1607.242222 |
| 49 | 28 | الدقهلية | حضر | 31.159029 | 31.936320 | 25787.155556 | A | 25000 | 25787.155556 | 25 | 787.155556 |
| 50 | 107 | الدقهلية | NaN | 31.093528 | 31.305411 | NaN | B | 15000 | 15635.557778 | 71 | 635.557778 |
| 51 | 31 | القاهرة | حضر | 29.847197 | 31.377750 | 24386.654444 | A | 25000 | 25488.334444 | 27 | 488.334444 |
| 52 | 25 | بني سويف | NaN | 29.342944 | 31.208105 | NaN | B | 15000 | 15339.275556 | 214 | 339.275556 |
DisplayResults
plotDataFrameLongLatWithType(resultDfDBSCNOPT2,showdistance=True,showNumber=True)
BriefSummary
print(f"The Total Number of Location Selected:{len(resultDfDBSCNOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfDBSCNOPT2[resultDfDBSCNOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfDBSCNOPT2[resultDfDBSCNOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfDBSCNOPT2)}")
print(f"The Total Sales Equals:{resultDfDBSCNOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfDBSCNOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfDBSCNOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:53 The Total Number of A-Location Selected:18 The Total Number of B-Location Selected:35 The Total Number of Locations Covered by selected:169 The Total Sales Equals:1794523.101977778 The Total Coverage Equals:819523.1019777778 The Total Costs Equals:975000
OPTICS
optics = OPTICS(min_samples=2, eps=5, metric='haversine',algorithm="brute")
y_db = optics.fit_predict(srd.fit_transform(np.radians(combinedDfOPTICSOPT2[["Longitude","Latitude"]])))
combinedDfOPTICSOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfOPTICSOPT2,clstr=True)
combinedDfOPTICSOPT2['updatedEstimastedSales'] = combinedDfOPTICSOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
displayRemovedLocations(combinedDfOPTICSOPT2)
combinedDfOPTICSOPT2=combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["updatedEstimastedSales"]>combinedDfOPTICSOPT2["operational_costs"]]
combinedDfOPTICSOPT2["Coverage"]=combinedDfOPTICSOPT2["updatedEstimastedSales"]-combinedDfOPTICSOPT2["operational_costs"]
noiseData=combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]==-1]
startcount=max(combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]!=-1], noiseData], ignore_index=True))
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfOPTICSOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfOPTICSOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | الشرقية | NaN | 30.733709 | 31.791613 | NaN | B | 15000 | 85931.800000 | 0 | 70931.800000 |
| 1 | 63 | الشرقية | NaN | 30.722667 | 31.664328 | NaN | B | 15000 | 53305.933333 | 10 | 38305.933333 |
| 2 | 60 | الشرقية | NaN | 30.750426 | 31.452409 | NaN | B | 15000 | 22111.735556 | 36 | 7111.735556 |
| 3 | 62 | الشرقية | NaN | 30.523139 | 31.348689 | NaN | B | 15000 | 46646.990000 | 51 | 31646.990000 |
| 4 | 84 | المنوفية | NaN | 30.549361 | 31.036855 | NaN | B | 15000 | 17693.730000 | 99 | 2693.730000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 344 | 2492 | بني سويف | ريف | 29.029384 | 31.082819 | 40.915556 | A | 25000 | 817101.511389 | 1217 | 792101.511389 |
| 345 | 2507 | أسوان | حضر | 23.979483 | 32.896105 | 37.955556 | A | 25000 | 817101.511389 | 1219 | 792101.511389 |
| 346 | 2509 | قنا | ريف | 26.194544 | 32.554853 | 37.046667 | A | 25000 | 817101.511389 | 1220 | 792101.511389 |
| 347 | 2511 | المنيا | ريف | 28.454434 | 30.668412 | 36.696667 | A | 25000 | 817101.511389 | 1221 | 792101.511389 |
| 348 | 43 | المنيا | NaN | 28.093250 | 30.814605 | NaN | B | 15000 | 817101.511389 | 1227 | 802101.511389 |
349 rows × 11 columns
plotDataFrameLongLatWithType(resultDfOPTICSOPT2,showdistance=True,showNumber=True)
print(f"The Total Number of Location Selected:{len(resultDfOPTICSOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfOPTICSOPT2[resultDfOPTICSOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfOPTICSOPT2[resultDfOPTICSOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfOPTICSOPT2)}")
print(f"The Total Sales Equals:{resultDfOPTICSOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfOPTICSOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfOPTICSOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:349 The Total Number of A-Location Selected:311 The Total Number of B-Location Selected:38 The Total Number of Locations Covered by selected:708 The Total Sales Equals:238218585.73551106 The Total Coverage Equals:229873585.73551106 The Total Costs Equals:8345000
HDBSCAN
combinedDfHDBSCNOPT2=combinedDFOption2
hdbscn = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1, cluster_selection_epsilon=5/6371.0, metric='haversine')
y_db = hdbscn.fit_predict(srd.fit_transform(np.radians(combinedDfHDBSCNOPT2[["Longitude","Latitude"]])))
combinedDfHDBSCNOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfHDBSCNOPT2,clstr=True)
combinedDfHDBSCNOPT2['updatedEstimastedSales'] = combinedDfHDBSCNOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
displayRemovedLocations(combinedDfHDBSCNOPT2)
combinedDfHDBSCNOPT2=combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["updatedEstimastedSales"]>combinedDfHDBSCNOPT2["operational_costs"]]
combinedDfHDBSCNOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | الشرقية | حضر | 30.727543 | 31.793701 | 72021.026667 | A | 25000 | 91445.203333 | 593 |
| 1 | 2 | القاهرة | حضر | 30.235685 | 31.467567 | 54690.637778 | A | 25000 | 57434.150000 | 56 |
| 2 | 3 | المنوفية | حضر | 30.361963 | 30.512031 | 42379.708889 | A | 25000 | 51029.818889 | 31 |
| 3 | 4 | أسوان | حضر | 24.977806 | 32.875934 | 42377.312222 | A | 25000 | 62155.374444 | 101 |
| 4 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 | A | 25000 | 46747.784444 | 28 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2660 | 125 | القليوبية | NaN | 30.310222 | 31.323466 | NaN | B | 15000 | 66996.411789 | 669 |
| 2663 | 128 | القليوبية | NaN | 30.145500 | 31.215355 | NaN | B | 15000 | 31718.676667 | 674 |
| 2664 | 129 | القليوبية | NaN | 30.318393 | 31.310492 | NaN | B | 15000 | 66996.411789 | 669 |
| 2666 | 131 | القليوبية | NaN | 30.138000 | 31.220216 | NaN | B | 15000 | 31718.676667 | 674 |
| 2667 | 132 | الاسماعيلية | NaN | 30.852917 | 32.306078 | NaN | B | 15000 | 31237.639367 | 42 |
598 rows × 10 columns
combinedDfHDBSCNOPT2["Coverage"]=combinedDfHDBSCNOPT2["updatedEstimastedSales"]-combinedDfHDBSCNOPT2["operational_costs"]
noiseData=combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]==-1]
startcount=max(combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]!=-1], noiseData], ignore_index=True))
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfHDBSCNOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfHDBSCNOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30 | شمال سيناء | حضر | 31.126209 | 33.808316 | 25193.351111 | A | 25000 | 27797.405556 | 12 | 2797.405556 |
| 1 | 41 | البحر الاحمر | حضر | 28.340011 | 33.086865 | 21800.420000 | A | 25000 | 29658.217778 | 20 | 4658.217778 |
| 2 | 10 | جنوب سيناء | حضر | 27.860291 | 34.303116 | 34468.973333 | A | 25000 | 42950.647778 | 22 | 17950.647778 |
| 3 | 77 | مرسى مطروح | حضر | 31.318053 | 27.287814 | 15664.434444 | A | 25000 | 37672.875556 | 27 | 12672.875556 |
| 4 | 5 | مرسى مطروح | حضر | 31.355745 | 27.244068 | 40826.774444 | A | 25000 | 46747.784444 | 28 | 21747.784444 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 289 | 2430 | الشرقية | ريف | 30.693489 | 31.739608 | 50.696667 | A | 25000 | 624781.123611 | 1042 | 599781.123611 |
| 290 | 2488 | الدقهلية | ريف | 30.834214 | 31.263532 | 41.683333 | A | 25000 | 624781.123611 | 1046 | 599781.123611 |
| 291 | 2507 | أسوان | حضر | 23.979483 | 32.896105 | 37.955556 | A | 25000 | 624781.123611 | 1048 | 599781.123611 |
| 292 | 2511 | المنيا | ريف | 28.454434 | 30.668412 | 36.696667 | A | 25000 | 624781.123611 | 1049 | 599781.123611 |
| 293 | 43 | المنيا | NaN | 28.093250 | 30.814605 | NaN | B | 15000 | 624781.123611 | 1052 | 609781.123611 |
294 rows × 11 columns
plotDataFrameLongLatWithType(resultDfHDBSCNOPT2,showdistance=True,showNumber=True)
print(f"The Total Number of Location Selected:{len(resultDfHDBSCNOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfHDBSCNOPT2[resultDfHDBSCNOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfHDBSCNOPT2[resultDfHDBSCNOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfHDBSCNOPT2)}")
print(f"The Total Sales Equals:{resultDfHDBSCNOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfHDBSCNOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfHDBSCNOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:294 The Total Number of A-Location Selected:253 The Total Number of B-Location Selected:41 The Total Number of Locations Covered by selected:598 The Total Sales Equals:147115382.07745552 The Total Coverage Equals:140175382.07745552 The Total Costs Equals:6940000
FinalResult=pd.concat([optionOneDFResult, optionTwoDFResult],ignore_index=True)
FinalResult
| choices | # of-Loc | # of-A-Loc | # of-B-Loc | # of-LocCover | TotalSales | TotalCoverage | TotalCosts | |
|---|---|---|---|---|---|---|---|---|
| 0 | HDBSCAN | 119 | 80 | 39 | 646 | 5.101435e+06 | 2.516435e+06 | 2585000 |
| 1 | DBSCAN | 138 | 96 | 42 | 646 | 5.701419e+06 | 2.671419e+06 | 3030000 |
| 2 | OPTICS | 121 | 82 | 39 | 646 | 5.181673e+06 | 2.546673e+06 | 2635000 |
| 3 | BallTree | 136 | 94 | 42 | 646 | 5.579939e+06 | 2.599939e+06 | 2980000 |
| 4 | Brute | 137 | 94 | 43 | 647 | 5.601378e+06 | 2.606378e+06 | 2995000 |
| 5 | HDBSCAN-OPT2 | 294 | 253 | 41 | 598 | 1.471154e+08 | 1.401754e+08 | 6940000 |
| 6 | DBSCAN-OPT2 | 53 | 18 | 35 | 169 | 1.794523e+06 | 8.195231e+05 | 975000 |
| 7 | OPTICS-OPT2 | 349 | 311 | 38 | 708 | 2.382186e+08 | 2.298736e+08 | 8345000 |
FinalResult["ProfitMargin"]=FinalResult["TotalCoverage"]/FinalResult["TotalSales"]
FinalResult["CostPerLocation"]=FinalResult["TotalCosts"]/FinalResult["# of-Loc"]
FinalResult["LocationCoverage"]=FinalResult["# of-Loc"]/FinalResult["# of-LocCover"]
FinalResult["EfficiencyMetric"]=FinalResult["ProfitMargin"]*(1/FinalResult["LocationCoverage"])
FinalResult
| choices | # of-Loc | # of-A-Loc | # of-B-Loc | # of-LocCover | TotalSales | TotalCoverage | TotalCosts | ProfitMargin | CostPerLocation | LocationCoverage | EfficiencyMetric | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HDBSCAN | 119 | 80 | 39 | 646 | 5.101435e+06 | 2.516435e+06 | 2585000 | 0.493280 | 21722.689076 | 0.184211 | 2.677805 |
| 1 | DBSCAN | 138 | 96 | 42 | 646 | 5.701419e+06 | 2.671419e+06 | 3030000 | 0.468553 | 21956.521739 | 0.213622 | 2.193373 |
| 2 | OPTICS | 121 | 82 | 39 | 646 | 5.181673e+06 | 2.546673e+06 | 2635000 | 0.491477 | 21776.859504 | 0.187307 | 2.623918 |
| 3 | BallTree | 136 | 94 | 42 | 646 | 5.579939e+06 | 2.599939e+06 | 2980000 | 0.465944 | 21911.764706 | 0.210526 | 2.213234 |
| 4 | Brute | 137 | 94 | 43 | 647 | 5.601378e+06 | 2.606378e+06 | 2995000 | 0.465310 | 21861.313869 | 0.211747 | 2.197487 |
| 5 | HDBSCAN-OPT2 | 294 | 253 | 41 | 598 | 1.471154e+08 | 1.401754e+08 | 6940000 | 0.952826 | 23605.442177 | 0.491639 | 1.938061 |
| 6 | DBSCAN-OPT2 | 53 | 18 | 35 | 169 | 1.794523e+06 | 8.195231e+05 | 975000 | 0.456680 | 18396.226415 | 0.313609 | 1.456207 |
| 7 | OPTICS-OPT2 | 349 | 311 | 38 | 708 | 2.382186e+08 | 2.298736e+08 | 8345000 | 0.964969 | 23911.174785 | 0.492938 | 1.957588 |
Analysis
Profit Margin: Highest: OPTICS-OPT2 (0.964969) Second Highest: HDBSCAN-OPT2 (0.952826)
Total Coverage: Highest: OPTICS-OPT2 (2.298736e+08) Second Highest: HDBSCAN-OPT2 (1.401754e+08)
Efficiency Metric: Highest: HDBSCAN (2.677805) Second Highest: OPTICS (2.623918)
Cost Per Location: Lowest: DBSCAN-OPT2 (18396.226415) Second Lowest: HDBSCAN (21722.689076)
Conclusion
OPTICS-OPT2 emerges as the best overall choice due to its: Highest Profit Margin (0.964969) Highest Total Coverage (2.298736e+08)
While its Efficiency Metric (1.957588) is lower than that of HDBSCAN and OPTICS, the substantial profit margin and total coverage outweigh this factor. Additionally, its cost per location is relatively high, but this is compensated by the significantly higher coverage and profit margin.
resultDfOPTICSOPT2
| ID | Gov | City_Type | Latitude | Longitude | Estimated_Sales | Type | operational_costs | updatedEstimastedSales | cluster | Coverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57 | الشرقية | NaN | 30.733709 | 31.791613 | NaN | B | 15000 | 85931.800000 | 0 | 70931.800000 |
| 1 | 63 | الشرقية | NaN | 30.722667 | 31.664328 | NaN | B | 15000 | 53305.933333 | 10 | 38305.933333 |
| 2 | 60 | الشرقية | NaN | 30.750426 | 31.452409 | NaN | B | 15000 | 22111.735556 | 36 | 7111.735556 |
| 3 | 62 | الشرقية | NaN | 30.523139 | 31.348689 | NaN | B | 15000 | 46646.990000 | 51 | 31646.990000 |
| 4 | 84 | المنوفية | NaN | 30.549361 | 31.036855 | NaN | B | 15000 | 17693.730000 | 99 | 2693.730000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 344 | 2492 | بني سويف | ريف | 29.029384 | 31.082819 | 40.915556 | A | 25000 | 817101.511389 | 1217 | 792101.511389 |
| 345 | 2507 | أسوان | حضر | 23.979483 | 32.896105 | 37.955556 | A | 25000 | 817101.511389 | 1219 | 792101.511389 |
| 346 | 2509 | قنا | ريف | 26.194544 | 32.554853 | 37.046667 | A | 25000 | 817101.511389 | 1220 | 792101.511389 |
| 347 | 2511 | المنيا | ريف | 28.454434 | 30.668412 | 36.696667 | A | 25000 | 817101.511389 | 1221 | 792101.511389 |
| 348 | 43 | المنيا | NaN | 28.093250 | 30.814605 | NaN | B | 15000 | 817101.511389 | 1227 | 802101.511389 |
349 rows × 11 columns
plotDataFrameLongLatWithType(resultDfOPTICSOPT2,showdistance=True,showNumber=True)